53 research outputs found

    Mixtures of g-priors in Generalized Linear Models

    Full text link
    Mixtures of Zellner's g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to Generalized Linear Models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this paper, we unify mixtures of g-priors in GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1 + g), which encompasses as special cases several mixtures of g-priors in the literature, such as the hyper-g, Beta-prime, truncated Gamma, incomplete inverse-Gamma, benchmark, robust, hyper-g/n, and intrinsic priors. Through an integrated Laplace approximation, the posterior distribution of 1/(1 + g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically, leading to "Compound Hypergeometric Information Criteria" for model selection. We discuss the local geometric properties of the g-prior in GLMs and show how the desiderata for model selection proposed by Bayarri et al, such as asymptotic model selection consistency, intrinsic consistency, and measurement invariance may be used to justify the prior and specific choices of the hyper parameters. We illustrate inference using these priors and contrast them to other approaches via simulation and real data examples. The methodology is implemented in the R package BAS and freely available on CRAN

    Stochastic expansions using continuous dictionaries: L\'{e}vy adaptive regression kernels

    Get PDF
    This article describes a new class of prior distributions for nonparametric function estimation. The unknown function is modeled as a limit of weighted sums of kernels or generator functions indexed by continuous parameters that control local and global features such as their translation, dilation, modulation and shape. L\'{e}vy random fields and their stochastic integrals are employed to induce prior distributions for the unknown functions or, equivalently, for the number of kernels and for the parameters governing their features. Scaling, shape, and other features of the generating functions are location-specific to allow quite different function properties in different parts of the space, as with wavelet bases and other methods employing overcomplete dictionaries. We provide conditions under which the stochastic expansions converge in specified Besov or Sobolev norms. Under a Gaussian error model, this may be viewed as a sparse regression problem, with regularization induced via the L\'{e}vy random field prior distribution. Posterior inference for the unknown functions is based on a reversible jump Markov chain Monte Carlo algorithm. We compare the L\'{e}vy Adaptive Regression Kernel (LARK) method to wavelet-based methods using some of the standard test functions, and illustrate its flexibility and adaptability in nonstationary applications.Comment: Published in at http://dx.doi.org/10.1214/11-AOS889 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian nonparametric models for peak identification in MALDI-TOF mass spectroscopy

    Full text link
    We present a novel nonparametric Bayesian approach based on L\'{e}vy Adaptive Regression Kernels (LARK) to model spectral data arising from MALDI-TOF (Matrix Assisted Laser Desorption Ionization Time-of-Flight) mass spectrometry. This model-based approach provides identification and quantification of proteins through model parameters that are directly interpretable as the number of proteins, mass and abundance of proteins and peak resolution, while having the ability to adapt to unknown smoothness as in wavelet based methods. Informative prior distributions on resolution are key to distinguishing true peaks from background noise and resolving broad peaks into individual peaks for multiple protein species. Posterior distributions are obtained using a reversible jump Markov chain Monte Carlo algorithm and provide inference about the number of peaks (proteins), their masses and abundance. We show through simulation studies that the procedure has desirable true-positive and false-discovery rates. Finally, we illustrate the method on five example spectra: a blank spectrum, a spectrum with only the matrix of a low-molecular-weight substance used to embed target proteins, a spectrum with known proteins, and a single spectrum and average of ten spectra from an individual lung cancer patient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS450 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian model search and multilevel inference for SNP association studies

    Full text link
    Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally ``validated'' in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS322 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Statistical methods for automated drug susceptibility testing: Bayesian minimum inhibitory concentration prediction from growth curves

    Get PDF
    Determination of the minimum inhibitory concentration (MIC) of a drug that prevents microbial growth is an important step for managing patients with infections. In this paper we present a novel probabilistic approach that accurately estimates MICs based on a panel of multiple curves reflecting features of bacterial growth. We develop a probabilistic model for determining whether a given dilution of an antimicrobial agent is the MIC given features of the growth curves over time. Because of the potentially large collection of features, we utilize Bayesian model selection to narrow the collection of predictors to the most important variables. In addition to point estimates of MICs, we are able to provide posterior probabilities that each dilution is the MIC based on the observed growth curves. The methods are easily automated and have been incorporated into the Becton--Dickinson PHOENIX automated susceptibility system that rapidly and accurately classifies the resistance of a large number of microorganisms in clinical samples. Over seventy-five studies to date have shown this new method provides improved estimation of MICs over existing approaches.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS217 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian Methods for Analysis and Adaptive Scheduling of Exoplanet Observations

    Full text link
    We describe work in progress by a collaboration of astronomers and statisticians developing a suite of Bayesian data analysis tools for extrasolar planet (exoplanet) detection, planetary orbit estimation, and adaptive scheduling of observations. Our work addresses analysis of stellar reflex motion data, where a planet is detected by observing the "wobble" of its host star as it responds to the gravitational tug of the orbiting planet. Newtonian mechanics specifies an analytical model for the resulting time series, but it is strongly nonlinear, yielding complex, multimodal likelihood functions; it is even more complex when multiple planets are present. The parameter spaces range in size from few-dimensional to dozens of dimensions, depending on the number of planets in the system, and the type of motion measured (line-of-sight velocity, or position on the sky). Since orbits are periodic, Bayesian generalizations of periodogram methods facilitate the analysis. This relies on the model being linearly separable, enabling partial analytical marginalization, reducing the dimension of the parameter space. Subsequent analysis uses adaptive Markov chain Monte Carlo methods and adaptive importance sampling to perform the integrals required for both inference (planet detection and orbit measurement), and information-maximizing sequential design (for adaptive scheduling of observations). We present an overview of our current techniques and highlight directions being explored by ongoing research.Comment: 29 pages, 11 figures. An abridged version is accepted for publication in Statistical Methodology for a special issue on astrostatistics, with selected (refereed) papers presented at the Astronomical Data Analysis Conference (ADA VI) held in Monastir, Tunisia, in May 2010. Update corrects equation (3

    A tutorial on Bayesian multi-model linear regression with BAS and JASP

    Get PDF
    Linear regression analyses commonly involve two consecutive stages of statistical inquiry. In the first stage, a single ‘best’ model is defined by a specific selection of relevant predictors; in the second stage, the regression coefficients of the winning model are used for prediction and for inference concerning the importance of the predictors. However, such second-stage inference ignores the model uncertainty from the first stage, resulting in overconfident parameter estimates that generalize poorly. These drawbacks can be overcome by model averaging, a technique that retains all models for inference, weighting each model’s contribution by its posterior probability. Although conceptually straightforward, model averaging is rarely used in applied research, possibly due to the lack of easily accessible software. To bridge the gap between theory and practice, we provide a tutorial on linear regression using Bayesian model averaging in JASP, based on the BAS package in R. Firstly, we provide theoretical background on linear regression, Bayesian inference, and Bayesian model averaging. Secondly, we demonstrate the method on an example data set from the World Happiness Report. Lastly, we discuss limitations of model averaging and directions for dealing with violations of model assumptions

    Association between DNA Damage Response and Repair Genes and Risk of Invasive Serous Ovarian Cancer

    Get PDF
    BACKGROUND: We analyzed the association between 53 genes related to DNA repair and p53-mediated damage response and serous ovarian cancer risk using case-control data from the North Carolina Ovarian Cancer Study (NCOCS), a population-based, case-control study. METHODS/PRINCIPAL FINDINGS: The analysis was restricted to 364 invasive serous ovarian cancer cases and 761 controls of white, non-Hispanic race. Statistical analysis was two staged: a screen using marginal Bayes factors (BFs) for 484 SNPs and a modeling stage in which we calculated multivariate adjusted posterior probabilities of association for 77 SNPs that passed the screen. These probabilities were conditional on subject age at diagnosis/interview, batch, a DNA quality metric and genotypes of other SNPs and allowed for uncertainty in the genetic parameterizations of the SNPs and number of associated SNPs. Six SNPs had Bayes factors greater than 10 in favor of an association with invasive serous ovarian cancer. These included rs5762746 (median OR(odds ratio)(per allele) = 0.66; 95% credible interval (CI) = 0.44-1.00) and rs6005835 (median OR(per allele) = 0.69; 95% CI = 0.53-0.91) in CHEK2, rs2078486 (median OR(per allele) = 1.65; 95% CI = 1.21-2.25) and rs12951053 (median OR(per allele) = 1.65; 95% CI = 1.20-2.26) in TP53, rs411697 (median OR (rare homozygote) = 0.53; 95% CI = 0.35 - 0.79) in BACH1 and rs10131 (median OR( rare homozygote) = not estimable) in LIG4. The six most highly associated SNPs are either predicted to be functionally significant or are in LD with such a variant. The variants in TP53 were confirmed to be associated in a large follow-up study. CONCLUSIONS/SIGNIFICANCE: Based on our findings, further follow-up of the DNA repair and response pathways in a larger dataset is warranted to confirm these results

    Do serum biomarkers really measure breast cancer?

    Get PDF
    Background Because screening mammography for breast cancer is less effective for premenopausal women, we investigated the feasibility of a diagnostic blood test using serum proteins. Methods This study used a set of 98 serum proteins and chose diagnostically relevant subsets via various feature-selection techniques. Because of significant noise in the data set, we applied iterated Bayesian model averaging to account for model selection uncertainty and to improve generalization performance. We assessed generalization performance using leave-one-out cross-validation (LOOCV) and receiver operating characteristic (ROC) curve analysis. Results The classifiers were able to distinguish normal tissue from breast cancer with a classification performance of AUC = 0.82 ± 0.04 with the proteins MIF, MMP-9, and MPO. The classifiers distinguished normal tissue from benign lesions similarly at AUC = 0.80 ± 0.05. However, the serum proteins of benign and malignant lesions were indistinguishable (AUC = 0.55 ± 0.06). The classification tasks of normal vs. cancer and normal vs. benign selected the same top feature: MIF, which suggests that the biomarkers indicated inflammatory response rather than cancer. Conclusion Overall, the selected serum proteins showed moderate ability for detecting lesions. However, they are probably more indicative of secondary effects such as inflammation rather than specific for malignancy.United States. Dept. of Defense. Breast Cancer Research Program (Grant No. W81XWH-05-1-0292)National Institutes of Health (U.S.) (R01 CA-112437-01)National Institutes of Health (U.S.) (NIH CA 84955

    [Comment] Redefine statistical significance

    Get PDF
    The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on “statistically significant” findings. There has been much progress toward documenting and addressing several causes of this lack of reproducibility (e.g., multiple testing, P-hacking, publication bias, and under-powered studies). However, we believe that a leading cause of non-reproducibility has not yet been adequately addressed: Statistical standards of evidence for claiming discoveries in many fields of science are simply too low. Associating “statistically significant” findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems. For fields where the threshold for defining statistical significance is P<0.05, we propose a change to P<0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called “significant” but do not meet the new threshold should instead be called “suggestive.” While statisticians have known the relative weakness of using P≈0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new (1, 2), a critical mass of researchers now endorse this change. We restrict our recommendation to claims of discovery of new effects. We do not address the appropriate threshold for confirmatory or contradictory replications of existing claims. We also do not advocate changes to discovery thresholds in fields that have already adopted more stringent standards (e.g., genomics and high-energy physics research; see Potential Objections below). We also restrict our recommendation to studies that conduct null hypothesis significance tests. We have diverse views about how best to improve reproducibility, and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P-values. However, changing the P-value threshold is simple and might quickly achieve broad acceptance
    corecore